Univariate Plots Section

Dataset description

## [1] 4898   13

Our data set consists of 13 variables, with 4,898 observations.

The data set is related to white “Vinho Verde” Portuguese wine.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Description of attributes:

  1. fixed acidity: most acids involved with wine are fixed or nonvolatile (do not evaporate readily);
  2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
  3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. chlorides: the amount of salt in the wine
  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content -> density might be related to alcohol and sugar
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant -> can be related to (7) or (6)
  11. alcohol: the percent alcohol content of the wine

This report explores a data set containing quality and attributes for 4,898 white wines.

Let’s first transform the quality variable into a factor variable. Let’s also create a numeric variable as the quality variable in order to be able to make plots with numerical data.

Check if the quality is a factor variable.

## [1] TRUE

There are 4898 observations spread into seven quality categories, from 3 (low quality) to 9 (good quality).

Understand White Wine quality

Let’s investigate how many observations are in each quality category.

Counts in each category:

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Let’s better visualize how many counts are in each category.

From the above bar chart we can see that most of our white wines have a quality of six with almost half of the observations, 2198. Close are the wines rated five, with 1457 samples and seven with 880 samples. We can see that there are very few samples of very good white wines. Only three were rated nine. On the other side, only 20 were poor with a rating of three. This can cause bias to our results.


Quality Percentages and Proportions

To better get a feel of the quality variable, let’s take a look at the percentage of each category from total.

These are the actual numbers:

##          3          4          5          6          7          8 
##  0.4083299  3.3278889 29.7468354 44.8754594 17.9665169  3.5728869 
##          9 
##  0.1020825

We can see that the data set is not well balanced in relation to how many observations are in each quality category. This can cause bias to our results and analysis. The optimal data set should include almost the same number of observations per each category. The middle category, 6, has almost half of the observations, 44.88%. The most extreme categories, 3 and 9 count for only 0.51% from the total number of. Wine quality, 4 and 5 count each for around 3.5% of the the data set, 5, with 29,75% and 7 category with 17.97%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The middle of our data set is quality 6, which makes sense since it counts for almost half of the observations. The average quality for our data set is slightly less, at 5.88. This is due to the fact that there are more observations with a quality less than 6, with 33.48%, while those with a quality higher count for 21.64% of the total observations.


Create only 3 levels of Quality

Let’s create only three levels of quality:

  1. low for quality 3, 4 and 5;
  2. medium for quality 6
  3. high for quality 7, 8 and 9
  • For this I will create a new variable factor in the data set called quality.type.
  • Then, I will assign the initial levels to the new ones: low, medium and premium.
  • This will create a more balanced data set categories and will let us better understand the other variables and how they can contribute to classifying the quality of the wine.

Now we have this new variable with only three levels to categorize the white wine quality.

## [1] "low"     "medium"  "premium"

##      low   medium  premium 
## 33.48305 44.87546 21.64149

Although this is not optimal, we can now see that these three categories are more balanced with:

  • 33.48% low quality
  • 44.99% medium quality
  • 21.64% premium quality

Using these categories, we will try to understand important features for the wine quality classification.


Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

From the above histogram we can see that the distribution of fixed.acidity looks normal. There are some slightly extreme points of fixed acidity on the right of the distribution. The peak of our distribution is around 6.8 g/dm^3. Fifty percent of the wines have a fixed acidity that ranges from 6.3 to 7.3 g/dm^3. Let’s facet the data to see it how it looks based on quality.

The premium category doesn’t have wines with a fixed acidity more than 9.2 g/dm^, with an average of 6.73 g/dm^3, less than the medium and low quality white wines. Rather than that, the distributions looks pretty normal with some extreme values is the low an medium quality white wines.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.400   6.800   6.962   7.500  11.800 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.838   7.300  14.200 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.700   6.725   7.200   9.200

If we zoom in, we can see that the average and median fixed acidity is lowest for the premium quality white wines.


Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The Volatile Acidity Histogram distribution is pulled to the right by bigger values, with a mean of 2.78 g/dm^3, higher than the median value of 2.60 g/dm^3. In this data set, white wines have a volatile acidity ranging from 0.08 to 0.32 g/dm^3.

Transforming our data with a log 10 scale, we can see that our data is much more normal.

Let’s facet wrap our data to better see what is happening per quality and apply a log scale.

The biggest values of volatile acidity are in the low quality white wines, with a median value of 0.29 g/dm^3. Although the values of volatile acidity for the medium quality white wines are more spread out, ranging from 0.08 to 0.9680 compared to premium wines, ranging from 0.08 to 0.76 g/dm^3, they have the share the same median value of 0.25 g/dm^3.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1000  0.2400  0.2900  0.3103  0.3500  1.1000 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2653  0.3200  0.7600

We can see here that low quality white wines have a bigger median value for volatile acidity but they share the same median for medium and premium quality white wines.


Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

We can see that there are some extreme points in our data set, which skews our data set to the right. In particular, we can see a maximum value of 1.66 g/dm^3 citric acid.

Based on the above histograms, we can identify that the medium quality white wines have the most extreme data points but its median value of 0.32 is the same as the low quality white wines, slightly bigger than premium white wines which have a median value of 0.31 g/dm^3.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2400  0.3200  0.3343  0.4100  1.0000 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.320   0.338   0.380   1.660 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.2800  0.3100  0.3261  0.3600  0.7400

We can see that low quality wines are more spread but there is no big difference in median and mean values.


Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

We can see that the distribution of residual sugar is right skewed with a long tail. In fact 25 percent of the data points are ranging from 9.9 to some very sweet wines of 65.8 g/dm^3. A white wine has, on average, 6.4 g/dm^3 residual sugar.

The transformed data shows a bimodal distribution, with a peak at around 2 and the other one at around 10. Perhaps this is due to the fact that we have dry wines and more sweet wines. Perhaps it would be a good idea to categorize these values and take a look if this has an impact on the quality.

We can see this bimodal distribution persists at the quality level. We can see that low quality has more sweeter white wines, with a median value of 6.63 g/dm^3. Although the medium quality white wines have the sweetest white wines, with a 65.8 g/dm^3 residual sugar, the median value is not bigger the low quality white wines. Premium quality white wines are more compact, with less residual sugar in this category, ranging from 0.8 to 19.25, with a median value of 3.88 g/dm^3.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   6.625   7.054  11.025  23.500 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   1.800   3.875   5.262   7.400  19.250

Here we can better see the extreme data points in the medium quality white wines. Also, we can see that the low quality wines have a bigger interquartile range. In contrast, the premium quality white wines have a smaller range of values and a slightly smaller median residual sugar value.

Sugar Levels

Let’s cut the residual.sugar variable into five categories.

There are 5 white wines categories:

  1. Extra Dry: 0-5 g/dm^3
  2. Dry: 5-10 g/dm^3
  3. Semi-Dry: 10-20 g/dm^3
  4. Semi-Sweet: 20-30 g/dm^3
  5. Sweet: 30-100+ g/dm^3

bin_edges = [2.72, 3.09, 3.18, 3.28, 3.82]

Labels for the four acidity level groups bin_names = [‘high’, ‘mod_high’, ‘medium’, ‘low’]

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
## 
##  extra_dry        dry   semi-dry semi_sweet      sweet 
##       2410       1295       1175         15          3

We can see that most wines are extra dry and only a few are in the sweeter categories.


Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most of the white wines have a chloride level between 0.036 - 0.05 g/dm^3. The average chloride in white wine is 0.46 g/dm^2 and is slightly bigger than the median value due to white wines with higher chloride.

If we look at the chlorides distribution for each quality category, we can see that the premium quality chlorides histogram has fewer extreme data points, with a mean chloride of 0.38.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05144 0.05300 0.34600 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500

We can better see here how the median and average chloride values are lower for medium and premium white wines.


Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

We can see that the values of free sulfur dioxide for the low quality white wines are more spread out, ranging from 2 - 289 mg/dm^3 while the medium and premium quality white wines have less high free sulfur dioxide levels with less than 112 for medium quality white wines and 108 for premium white quality white wines.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   20.00   34.00   35.34   49.00  289.00 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   24.00   34.00   35.65   46.00  112.00 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   33.00   34.55   42.00  108.00

From these box plots we can see that although low quality wines have more high free sulfur dioxide wines, they share almost the same median and almost the same average.


Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

There are some extreme white wines with more total sulfur dioxide with a maximum of 440 g/dm^3.

We can see that the most extreme values for total sulfur dioxide are in the low quality white wines.

We can see that the medium and low quality wines have more spread out data. The median and the average values for total sulfur dioxide are less for premium white wines and bigger for medium and low quality wines. I will analyse more this variable in the next section.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   117.0   149.0   148.6   182.0   440.0 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0   107.2   132.0   137.0   164.0   294.0 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    34.0   101.0   122.0   125.2   146.0   229.0

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The density distribution ranges from a minimum of 0.987 g/dm^3 to a maximum of 1.039 g/dm^3 with an average equal to the median value of around 0.994 g/dm^3. This indicates normally distributed data.

We can see here that most of our premium quality white wines have lower density with less data above 0.994. Also, we can also see two peaks for premium quality white wines, one near 0.992 and the other one around 0.997 g/dm^3.

From the box plots we can see that the mean and the median density values decrease per each quality level. Density might be a good feature to predict good wine quality. I will focus on density in the next parts.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9932  0.9951  0.9952  0.9971  1.0024 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9905  0.9917  0.9924  0.9936  1.0006

Density Levels

Let’s cut the density variable to create a density.levels factor with two levels: low_density, high_density.

## 
##  low_density high_density 
##         2442         2456

From the above bar charts it seems that premium quality white wines have lower density observations than low quality white wines.


pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH distribution looks quite normal, with the median of 3.18 value almost equal to the average of 3.19

There are some extreme data points in each category, with slightly bigger values for the premium category.

We can see that both the medians and the means are slightly higher for each quality.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.79    3.08    3.16    3.17    3.24    3.79 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.215   3.320   3.820

pH Levels

Let’s create a pH.levels factor variables with four levels:

  1. High: Lowest 25% of pH values
  2. Moderately High: 25% - 50% of pH values
  3. Medium: 50% - 75% of pH values
  4. Low: 75% - max pH value
## 
##     high mod_high   medium      low 
##     1314     1246     1156     1182

Average quality Ratings by Acidity Levels

## pf$pH.levels: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.799   6.000   8.000 
## -------------------------------------------------------- 
## pf$pH.levels: mod_high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.784   6.000   8.000 
## -------------------------------------------------------- 
## pf$pH.levels: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.889   6.000   9.000 
## -------------------------------------------------------- 
## pf$pH.levels: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   6.053   7.000   9.000

We can see that premium white wines have lower pH levels than medium and low quality white wines as compared to the other pH levels.


Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The sulphates histogram of white wines is right skewed, with the median, 0.47, being smaller than the average sulphates of 0.49 g/dm^3. The values ranges from a minimum of 0.22 to a maximum of 1.08 g/dm^3.

The transformed data looks more normally distributed, with the bulk of our data between 0.41 to 0.55 g/dm^3.

We can see that the premium quality white wine sulphates values are more spread out but the median value is equal to the premium quality wines, at 4.8 g/dm^3, slightly higher than low quality white wines which have a median value of 0.47.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.4100  0.4700  0.4815  0.5300  0.8800 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4800  0.4911  0.5500  1.0600 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4000  0.4800  0.5001  0.5800  1.0800

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

We can see that the alcohol values ranges from a minimum of 8% to 14.2 %. On average, a white wine has 10.5% alcohol.

Here we can see that most low quality white wines have an alcohol level less than 10.4% while the medium quality white wines is below 11.4 and the premium below 12.4%.

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.20    9.60    9.85   10.40   13.60 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   10.70   11.50   11.42   12.40   14.20

We can see that as white wine quality increases, the median and mean alcohol content increases. This might indicate a relation between alcohol and quality. Premium quality white wines have, on average, more alcohol than those low and medium Quality Types. Actually, alcohol might be a good predictor of wine quality and does need a closer look.


Alcohol Levels

Let’s create for alcohol three groups to see which one receives better ratings.

To answer this question, I will create three groups of wine samples:

  1. low < 11%
  2. moderate < 13%
  3. high > 13%
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
## 
##      low  low_mod moderate mod_high     high  highest 
##      502     1583     1252      850      609      102

Here we can see that premium quality white wines have more high alcohol content.


What is the structure of your dataset?

There are 4,898 white wines in our data set with 11 input variables and one output variable (based on sensory data):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)
  12. quality (score between 0 and 10)
  • There are two integer variables: X and quality, and the rest of the variables are numeric.
  • I transformed the quality variable into a factor variable.
  • I created a new variable in the data set quality.nr which is a numeric variable.

  • Most of the white wines have a quality of 6 which is the same as medium quality (45%). The 3, 4, 5 have a quality of low (33%), and the 7, 8, 9 have a quality of premium (22%).
  • Fixed acidity has a normal distribution.
  • If we transform volatile acidity with a log10 scale, the distribution looks normal.
  • If we transform residual sugar with a log10 scale, it has a bimodal distribution for dryer and sweeter wines. The median value for a premium white wine is 3.88, while for a low quality white wine is 6.63 g/dm^3.
  • The mean and the median density values decrease per each quality.type level.
  • pH distribution looks normal with a median value of 3.18. Premium quality wines have, on average, a higher pH.
  • As white wine quality increases, the median and mean alcohol content increases.
  • On average, a white wine has 10.5% alcohol.

What is/are the main feature(s) of interest in your dataset?

The main features of interest are alcohol and density. I suspect that alcohol and /or density and some combination of the other variables can be used to build a predictive model for white wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Fixed acidity, volatile acidity, residual sugar, free sulfur dioxide and pH may help in determining quality.

Did you create any new variables from existing variables in the dataset?

  1. From the quality variable I created two other variables. First, a numeric variable, quality.nr. Second, a factor variable called quality.type with three levels: low, medium and premium. In the low category I included the 3, 4, 5 quality categories. In the medium, the 6 quality category and in the premium, the 7, 8 and 9 quality levels.
  2. Sugar levels, factor variable with five levels: extra_dry, dry, semi-dry, semi_sweet, “sweet”. This variable is cut from residual.sugar based on the fact that after applying a log scale we saw a bimodal distribution and maybe there are some patterns related to quality, based on sugar levels that can be further analysed.
  3. Density levels, factor with two levels: low_density, high_density. This variable is cut from the density variable in order to better get a sense of how low or high density relates to white wine quality.
  4. pH levels, factor variable with four levels: high, moderately high, medium and low.
  5. Alcohol levels, factor variable with six levels: low, low_mod, moderate, mod_high, high, highest.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

For example, I saw that transforming residual sugar with a log scale the distribution becomes bimodal. This might indicate that there are categories of dryer to sweet white wine with less residual sugar for premium quality white wines.


Bivariate Plots Section

Scatterplot Matrix and Correlations

From this matrix we can better see the relationships between each of the two variables from our data set. Let’s also take a look at the Correlation Coefficient of quality versus the rest of the variables.

##                              [,1]
## fixed.acidity        -0.113662831
## volatile.acidity     -0.194722969
## citric.acid          -0.009209091
## residual.sugar       -0.097576829
## chlorides            -0.209934411
## free.sulfur.dioxide   0.008158067
## total.sulfur.dioxide -0.174737218
## density              -0.307123313
## pH                    0.099427246
## sulphates             0.053677877
## alcohol               0.435574715

From the above matrix we can see that quality has the highest Correlation Coefficient with:

  1. alcohol, 0.44 which might indicate a moderate positive linear relationship
  2. density, -0.31 which might indicate a moderate negative linear relationship
  3. chloride, volatile.acidity, total.sulfur.dioxide all might indicate weak negative linear relationship.

We have to also pay attention to the fact that alcohol and density are correlated to each other. They have a Correlation Coefficient of -0.78 which might indicate a strong negative linear relationship between them. When building a Linear Regression models, we have to be careful about multicollinearity, which means that independent variables must not be correlated. Density and residual sugar have a Correlation Coefficient of 0.84 which might indicate a strong positive linear relationship between them.

Let’s closely take a look at these variables.


Quality and Alcohol

Looking at this scatter plot we can see the moderate linear relationship between the two variables. The median and average alcohol are increasing as the quality increases. Better white wines tend to have more alcohol content.

Here, we can see that medium white wines have the highest variance of alcohol while low quality white wine typically fall below 12% alcohol. Although there are premium white wines with alcohol less than 10%, the majority are above this threshold.

Quality and Density

Here we can see some extreme values for density in the medium quality white wines. Let’s focus on the bulk of our data and see what is happening.

We can see that for high quality white wines the density has, on average, lower values.

Low values of density are more common for high quality white wines.


Quality and Chlorides

The highest variance of chlorides with bigger values are in the low and medium quality white wines. Based on the above graphs, on average, chloride is higher in low quality white wines. Or, good quality wines are less salty.


Density versus Alcohol

Let’s take a look also at two correlated features: Density and Alcohol.

Here we can see the strong negative relationship between density and alcohol. We can see that as alcohol increases, density tends to decrease.

Low density white wines have higher variance and greater alcohol content.

Here we can better break our density-alcohol relationship. We can see how low alcohol levels in white wine relates to higher density values.


Bar Charts of Mean Quality

Residual Sugar

## # A tibble: 3 x 14
##   quality.type mean_alcohol median_alcohol min_alcohol max_alcohol
##   <fct>               <dbl>          <dbl>       <dbl>       <dbl>
## 1 low                  9.85            9.6         8          13.6
## 2 medium              10.6            10.5         8.5        14  
## 3 premium             11.4            11.5         8.5        14.2
## # ... with 9 more variables: mean_density <dbl>, median_density <dbl>,
## #   min_density <dbl>, max_density <dbl>, mean_pH <dbl>, median_pH <dbl>,
## #   min_pH <dbl>, max_pH <dbl>, n <int>


Quality and Sulphates

We can see that the sulphates variance is bigger for medium and premium quality white wines.


Sugar versus Density

Sugar and density have a strong positive relationship with R^2 of 0.84.

Residual sugar has some bigger values. Let’s focus on the bulk of our data to better understand this relationship.

In the above scatter plot we can see the strong linear relationship between density and residual sugar. As density increases, residual sugar tend to increase in white wines. Also, it is visible how many of our white wine samples have less sugar.

High density white wines have a higher variance of residual sugar, while low density white wines are more compact, typically, with less than 10 g/dm^3.

The density of white wine varies in relation to its sugar content. From dryer white wines, with lower density, to sweeter white wines with higher density. It will be interesting to see how these trends relates to quality. We will look more closely in the next sections of the analysis. Also, we can see that there aren’t too many observations for our semi-sweet and sweet white wines categories.


Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the data set?

First, alcohol explains 44% of the variance in quality and this indicates a moderate linear relationship. The median and average alcohol content are slightly increasing as the quality increases. Better wines tend to have more alcohol content. If we cut alcohol into a categorical variable from low to highest alcohol levels we can clearly see how low quality white wines have fewer observations for higher alcohol content.

On the other hand, density explains only 31% of the variance in quality, which indicates a weak to moderate negative relationship. Low values of density are more common for high quality white wines. If we cut density into two categories: low density and high density we can see that higher values of density are associated more with lower quality in white wine, typically below 6. Density also has a strong positive linear correlation with residual sugar with an R^2 of 0.84.

Another interesting observation about alcohol and density is the fact that they also correlate to each other which may cause multicollinearity issues if we use these two variables into a linear model. There is a strong negative linear relationship with a correlation coefficient of -0.78. Low alcohol levels in white wine relates to high density values.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Sugar and density correlates with each other. The density of white wine varies in relation to its sugar content. From dryer white wines, with lower density, to sweeter white wines with higher density. High density white wines have a higher variance of residual sugar, while low density white wines are more compact, typically, with less than 10 g/dm^3.

What was the strongest relationship you found?

White wine quality is positively and moderately correlated with alcohol content. The strongest relationship I found is between density and sugar with a Correlation Coefficient of 0.84 which indicates a strong positive linear relationship.

Multivariate Plots Section

Quality and Alcohol by Alcohol Levels

These histograms shows the distribution of alcohol in white wines by alcohol levels, for each quality category. While the medium white wines distribution looks more normal, low quality white wines are more skewed to the right, with more lower alcohol levels white wines. The bulk of the data for premium quality white wines is more pulled to the left, with higher alcohol content.

The previous alcohol distributions by alcohol levels histograms clearly better shows how each Quality Type is structured in relation to alcohol. In this scatter plot our intuition that alcohol levels play a role in determining quality is further shown.

If we plot the median line for quality versus alcohol by each alcohol level, we can see that the variance is quite large. Based on the median line, lower alcohol content in white wines don’t have a rating higher than 6, while highest alcohol level content doesn’t have white wines with quality less than 6.

Quality and Density by Density Levels

While quality 6 has slightly more low density white wines, quality 5, 4 and 3 has more high density white wines and quality 7, and 8 has more low density white wines.

Chlorides by Quality Types

Here we can better see that chlorides is lower for the premium white wines.

Alcohol and Density

Quality by Density Levels for each Alcohol Level

The same pattern, high density white wines with lower alcohol content.

These density plots shows the distribution of white wine quality for each alcohol level by high and low density. We can see that white wines with moderate alcohol indicate roughly the same amount of low and high density per each Quality Type, with slightly less density in higher quality white wines. What is important to note is that there are low and high density white wines in poorer Quality Types, more specifically in 5, 4 but this is more present in wines with less alcohol content. This may be due to the fact that quality and alcohol are positively correlated and alcohol and density are negatively correlated as well.

It is easier to see here how premium white wines have more low density observations as opposed to high density.

Here we can see that although is it likely to find low alcohol white wines with low density, it is less likely to find higher alcohol white wines with high density.

Sugar, Density and White Wine quality

High density white wines tend to be more sweet. We can see that most low density white wines are dryer while high density white wines are sweeter.

Here we can see the same pattern but for each quality. Higher density with higher sugar levels, more often found for medium and low quality white wines.

Sugar versus Alcohol and Quality

We can see that residual sugar do account for variance in quality with premium quality white wines tend to have less sugar and higher alcohol levels.


Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

From the features that I have looked in, alcohol seems to be the most important feature in determining white wine quality with a combination of various other features. Lower alcohol levels in white wines, based on the median value, don’t have a quality rating higher than 6, while highest alcohol levels content don’t have white wines with quality less than 6. The more moderate alcohol levels varies from quality 5 to 7, so in the more superior quality ratings. Medium quality white wines don’t vary too much based on their density level. The difference comes in the more extreme Quality Types, with lower quality white wines and higher density and higher quality with lower density. It is interesting to note that moderate alcohol content in white wines have a similar density levels per each Quality Types. Higher alcohol levels have more high density observations. Although it is likely to find low alcohol white wines with low density, it is less likely to find higher alcohol white wines with high density.

Were there any interesting or surprising interactions between features?

High density white wine tend to have lower alcohol content and they tend to be sweeter. Also, premium white wines tend to have less sugar.

Linear Regression

Let’s see how much of White Wine quality is dependent on the input variables by fitting a regular linear model.

## 
## Call:
## lm(formula = I(quality.nr) ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = pf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.502e+02  1.880e+01   7.987 1.71e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

We can see that from the regression summary, all the input variables explain 28.03% white wine quality. Based on the p-values, there are some features that can be used to better predict white wine quality like: fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, density, pH, sulphates and alcohol.


Final Plots and Summary

Plot One

Description One

The white wine first bar chart shows how our observations are categorized into 7 ranks, from 3 to 9, lowest to highest quality. Due to the fact that the more extreme categories are not well represented in the data set I created a new ordinal data type with just three ranks: low, medium and premium.

Plot Two

## pf$quality.type: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.20    9.60    9.85   10.40   13.60 
## -------------------------------------------------------- 
## pf$quality.type: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## pf$quality.type: premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   10.70   11.50   11.42   12.40   14.20

Description Two

On average, alcohol content in white wines increases with about 1% units as quality increases from low to premium white wines, from 9.85% alcohol in low quality white wines to 11.42% alcohol in premium quality white wines. The same is true also for the median values: from a low of 9.6% to 11.5% alcohol in premium white wines.

Plot Three

Description Three

Alcohol and density have the highest coefficient correlations with the numeric value of quality with an R^2 of 0.44 for the first and -0.31 for the second. Also, density and alcohol have a strong negative linear correlation with an R^2 of -0.78. In relation to quality, there are more premium white wines with higher alcohol content and lower density and low quality white wines with higher density.


Reflection

First, I saw that the data set is not well balanced and this can cause bias to the results. The observations are not uniformly distributed across the quality categories. Therefore, I created a more inclusive quality categorical variable with only three levels: low, middle and high.

Then, I saw there are some main features of interest in our data set like alcohol and density. The median and average alcohol are increasing as the quality increases. Premium quality white wines tend to have more alcohol content. And, low values of density are more common for high quality white wines.

What is interesting is that density and alcohol also share a strong negative linear relationship. As alcohol increases, density tends to decrease. Density also has a strong positive linear relationship with residual sugar with an R^2 of 0.84, as density increases, residual sugar tend to increase in white wines. From dryer white wines, with lower density, to sweeter white wines with higher density.

Limitations about this analysis may be how the data set is structured, we may need more observations for most extreme white wine quality categories. A more inclusive data set may allow for a better comparison. There are also some outliers that can bias the results of the analysis. From the 11 chemical properties of white wine, it is important also to take into consideration only important features for white wine quality prediction.